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I. THE EEPISITIOH OF THE PROBIEH 



i. a FIEST IHSIGHT 

"A systolic system is a network of processors which rhyth- 
mically compute and pass data through the system. 
Physiologists use the word "systole” to refer to the 
rhythicically recurrent contraction of the heart and 
arteries which pulses blood through the body. In a 
systolic computing system, the function of a processor is 
analogous to that cf the heart. Every processor regularly 
pumps data in and cut, each time performing some short 
computation, so that a regular flow of data is kept in the 
network. ” 

"...matrix computations can be pipelined elegantly and 
efficiently on systolic networks having an array 
structure. " 

"...special purpose hardware devices based on systolic 
arrays can be built inexpensively using the VISI 
tecbnclogy- " 

Ihese are parts of the abstract of the paper where 
Professors H.T. Rung and C.E, Leiserson have first 
presented the concept of systolic arrays £Bef. 1]. They 
fully describe the concept and context to which the term 
"systolic array” applies. 

Several authors have presented papers on systolic arrays 
in recent years, but this subject is very new and everything 
cates tack to 1978. 

B. S7STCLIC ARRAYS; IBE BASIC PRINCIPLE 

A systolic system consists of a set of interconnected 
cells, each capable of performing some simple operation. 
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Information flows between cells in a pipelined fashion, and 
communication with the outside world occurs only at 
"boundary cells". 

Ihe usual computational tasks are basically divided into 
two categories — computing bound and input/output {I/O) 
bound. If the tctal number of computing operations is much 
larger than that of I/O operations, then we have a computing 
bound task. Systolic arrays are useful to speed up this 
type cf task. The array is able to use each input element a 
number cf times thus achieving a high computational 
throughput without increasing memory bandwidth. [Bef. 2] 

C. DIVISEB AECHITECIDBES 

A number of architectures have already been proposed for 
solving the traditional problems. These are linear arrays 
for performing matrix-vector multiplication and finding 
solution of triangular linear systems; two-dimensional 
arrays for matrix-matrix multiplication/addition, least 
squares solution and ID factorization, etc [Eef. 3]- 

E. HABBABE EXPLOBATCBY DEVElOiMENT 

Because this subject is new, much of the development is 
still in an exploratory stage. In order to further evaluate 
the hardware potential of systolic arrays, TEW’/ESl has 
developed a processor called Phoenix that has already been 
used to provide two-dimensional FIR filters for image 
processing [Bef. 3; page 3]. To evaluate systolic algo- 
rithms and architectures, the Daval Ocean System Center has 
developed a reconf igurable one-two dimensional systolic 
array with 64-processing elements [Bef- 4]. This systolic 
testbed processor can be reconfigured under software control 
to perform as a rectangular, hexagonal, or linear systolic 
array. Ihe main goal of its design was to achieve 
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flexiiility in algorithm evaluation, and its cells are much 
slower than those used in the TEH/ESL Phoenix processor or a 
possille implementaticn in a VISI chip. 

E. ■RHI HOT TO EXPLOBE IT WITH SOFTIARE? 

During the preliminary investigations in this field, the 
approach was to select a case study and try to go deeper in 
the process of understanding the way a particular algorithm 
is mapped onto a systolic architecture. It was discovered 
that it is not always straightforward to understand the 
mapping process. An algorithm like the one proposed by H.T. 
Fung and W.M. Gentleman for doing matrix triangularization 
can he really difficult to understand if one dees not 
possess an adequate tool [Ref. 7: page 22]. The algebraic 
manipulation can be difficult when dealing with systolic 
arrays because the interaction of the processors tend to 
lead to a very complicated set of equations in only a few 
steps and it becomes difficult to achieve any generaliza- 
tion. Seme notations have been proposed to help in the 
modeling the parallel processing £Ref. 5], but no systematic 
method for mapping an algorithm onto an array exists. There 
is a lack of the appropriate tools for systolic algorithm 
development. However, with the development of graphics 
technology a means now exists that can be used to help 
people better understand how a systolic array works. If we 
conjugate this possibility with interactive programming 
techniques, it may be possible provide an user friendly tool 
that can be used tc achieve the goal which is the main 
guideline of this investigation: to provide an easier way 
to understand systolic array operation. A case study will 
he presented in Chapter III that turned towards performing 
Matrix Triangularization with the Givens Rotations 
Algorithm. Eventually, the systolic processing can be 



understood and checked against real data with the use of our 
Systolic Array Graphics Simulator (SYSGRAS) . This approach, 
in ccntiast with that followed by researchers at the 
M.O.S.C., in San Diego, is to take full advantage of the 
software, using computer graphics presentation to improve or 
to add another dimension to the man/machine interface. 

Anyway we want to point out that this work must be seen 
as an investigation of the possibilities of using this kind 
of facilities as tools rather than as a final product. Only 
experimentation in practical work can show the advantages or 
the limitations of this approach. As in any other area of 
engineering, feedback from users is important in the evalua- 
tion and future imprcvement of this tool. 
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II. IHE S IMPLAIOB APPROAC H 



lie Systolic Array Graphical Simulator (SYSGRAS) is 
implemented to allow interactive design of any type of 
systolic array. The cells and their interconnection links 
are defined hy the user in a guided manner. The optional 
cell types are built into the SYSGRAS. The user can 
construct any array using the existing cell types or by 
introducing new cell types, tut the introduction of a new 
cell type cannot be done interactively. This reguires a 
subroutine to describe and support that type, but the opera- 
tion is simple and will be explained later. The presently 
available types can support most of the algorithms found in 
the literature. 

SYSGRAS is designed with a requirement of being user 
friendly. It has been realized that the amount of data that 
is normally required to be entered by a user in a session is 
quite large. It is important to offer the possibility of 
correcting errors, reviewing entered data, and storing data 
and results from the session. It can also be used later in 
another session. The sequence of operations required during 
a session is shown below: 

1) Establishment of general conditions: 

new or previous problem to run. 

need to store the calculations for later 
printing. 

ability tc interrupt or review the progress of 
systolic processing on a clocked basis. 

2) Definition of cells* attributes: 

type of processor in the cell, 
graphical shape of the cell, 
screen coordinates of the cell. 
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interfaces (importer cell, receiver of external 
data; internal cell, only exchanges data with 
other cells; exporter cell, data transmitter 

to external world) . 

3) Definition of links; 

define the origin and destination of each link. 
The origin is a port in a cell. Each cell has a 
number of output and input ports. The function 
of each cne depends on the processor type. 
The destination is a input port in another 
cell. A link is defined by origin cell 

output port and the destination cell input 

port. 

4) Presentation on the graphical screen; 

the pictvre of the user defined array is 
initially presented on the screen with all 
processor registers initialized to zero. Since 
the presentation is strongly based on colors, 
the non existence of data interaction in the 
cells makes them all turn into black. The 

identification number given by the user to 
each cell is placed at the cell’s lower right 
corner and red arrows indicate the linking 
between cells and the direction of data flux 
(see Fig. 4.7 at Chapter IV). 

5) Data input and screen update: 

as soon as data originated from the external 
world to the importer cells is entered, the 
whole screen is updated and representing the 
status of the systolic array at that partic- 
ular clock cycle. The external data are entered 
at each clock cycle, and the screen is updated 
accordingly. 



17 



6) Review of the ccmplete sjstolic process; 

optionally, the user can review the complete 
process. The tctal data generated in a 
session are stored automatically, and the 
graphics presentation of any previous session 
can be seen again without any additional 
data entry. 

The central idea of the STSGRAS is to allow a user to 
construct any desired systolic array and to use as the input 
test data such that insight can be gained about the data 
interactions of the systolic algorithm. The user has the 
capability to correlate the propagation of the input data 
with the colors of the cells to better understand the 
process. Certainly, the greater the creativity of assigning 
the color, the better the result showing the interactions. 
There is no restriction as to how to do the correlation and 
several examples will be presented. The colors that are 
available to a input datum are red, green and blue. If a 
cell receives data from more than one source with different 
primary colors, these colors will combine. The cell will 
show on the screen with a resulting color that is the combi- 
nation of the input primary colors. This resulting color is 
not passed on to other cells. The propagation of data is 
associated with the primary colors, and this way a good 
tracking of the timing that takes for each data wavefront to 
propagate through the array can be achieved. The screen 
display shows the array in a way selected by the user. The 
cells are presented with the shape, position, identification 
and connections requested by the user. The colors and numer- 
ical results assumed in each clock cycle show the data 
interaction. The numerical results that appear on the 
screen is in a fixed format and with rather limited preci- 
sion. If one wants greater precision, there is an option to 
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record the session which can show results with up to five 
decimals on printer output. These results can he checked 
against the known values to verify the correct operation of 
the mapped systolic algorithm. 
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III. A CASE STUDY : HATRIX THIAN GUIARIZATI OH BY GIVE NS 

ROT AT IONS 

A. A REASON FOR MATRIX TRIAHGDLARIZATION: QR DECOMPOSITION 

The technique of CR decomposition is useful to solve the 
linear least Squares Problem. There are other possible 
approaches in the literature to solve this type of problem 
[Ref. 6]. We will select the QR Decomposition Method as 
mentioned by Gentleman and Rung in [Ref. 7: page 19]. let's 
present a brief explanation of the method. 

Given a linear system defined by 

AX = E 

where A is a (n x p) matrix, X is a (p x 1) and B is a (n x 
1) column vector. We want tc find the vector X such that 

I jr 1 I =1 1 E-AX| I is minimized. The vector X is also called 
vec t or of regression c oefficient s. This is in fact a vector 
of estimated elements and as such it strictly should be 
written X but for convenience X will continue to be used. 
Also a mere rigorous matrix notation would be, for example, 
A instead of A. If we are able to find an orthogonal matrix 
Q such that QA=E where R is an upper triangular matrix, then 
we will have 



QAX = BX = QB 
and, as a result, RX = QB. 

As E, Q, and E are all known, we can find X by Back 
Substitution. Gentleman has shown, as seen in [ Ref. 8: page 
329], that this approach solves the Least Squares Problem 
because it solves the normal equations of the problem 
[Ref. 6: page 148]. 
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Ad accurate QR DecompositioD can be obtained in several 
ways, like the Gram- Schmidt Process or Householder 
Transformations. For this particular study we will stick to 
the Givers Rotations Cethod. This method requires the use of 
a sequence of plane rctations on matrices A and 3, that will 
convert the linear system to a representation of the form 

RX = CB 

which is straightforward to solve. 

The Givens Rotations Method has two different implemen- 
tations [Ref. 8: page 331]. We will study the one called 
"with square roots". 

E- QB DECOMPOSITION EY GIVENS ROTATIONS WITH SQOARE HOOTS 

If Q is an orthogonal matrix, Y=QA is called an orthog- 
onal transformation and the columns of Q, say q{1), g{2), 
... are orthogonal. In order to better understand what an 
orthogonal transformation means, we can interpret the q(ij’s 
as unit vectors (because of their unit length) that repre- 
sent the new reference frame in the original coordinates 
frame. If we apply an orthogonal transformation to a matrix 
(or to a vector) , we will be in fact changing the reference 
coordinates for that vector. The information content of the 
transformed matrix (or vector) will be the same, although 
different from previous form because of the new reference. 
The columns of Q are given b^ the dir ecti on co^nes of the 
ne w ref ere nce a xes with respec t to the old refere nce s ystem 
[fief. 12: page 354]. 

An orthogonal transformation can be divided into a 
sequence of elementary transformations. In the case of 
Givens Rctations, each one of these elementary transforma- 
tions is a rotation about one coordinate axis of the orig- 
inal reference frame (which may be n-dimensional) . 
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In geometry we knew that the transformation 
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applied to a vector [a1^ bl, cl] will rotate the original 
reference frame about the 3 axes at the same time. Ihe 
direction cosines of the new x-axis are cos(xl), ccs{x2) and 
cos(x3); the direction cosines of the new y-axis are 
cos(yl), cos (y2) and cos{y3), and so on. The transformation 
effect achieved by the matrix of cosines is similar to that 
obtained by the Givens Rotation transformation matrix Q.r 

Re can split the cosines matrix into a product of 
matrices each one representing a rotation about one axis of 
the original frame. This technique of decomposing the oper- 
ation into product operations, each one being responsible to 
rotate the system matrix of an angle TETA about one axis of 
the original frame, is exactly what is done in Givens 
algorithm. Each individual Givens’ Rotation is represented 
by a matrix of the form 
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where sgr(c) ♦ sgr (s) = 1, c = cos (TETA) , s = sin (TETA) and 
dots represent zeros [Bef. 6: page 153]. This elementary 

rotation will convert the (j,i) element of the matrix which 
it premultiplies intc zero, that is, the (j,i) element of 
the product of matrix D(j,i) and A will be 

-s.a{i,i) + c.a(j,i) = 0 

and sc it follows that 

d = sgrt (sgr (a (i,i) ) + sgr(a(j,i))) 



c = a(i,i)/d and s = a(j,i)/d 

If the matrix A is (n x n) , it will be necessary to have 
(n-1) elementary rotations to turn all elements of the first 
column ( with exception of a (1,1) ) onto zero, (n - 2) rota- 
tions for the second column, and so on. For a (3 x 3) 
matrix, three rotations will suffice. 

C. HAPPING GIVENS HCTATIONS AIGOBITHM INTO A STSTOIIC AREAY 

The problem of mapping an algorithm into a systolic 
structure freguently turns out to be the subject of certain 
restrictions because the computed results may result in very 
small numbers impossible to be represented if fixed point 
hardware is used. In crder to avoid establishing undesirable 
conditions on the data, one must try to select an algorithm 
possible to be mapped and to perform computations cn any 
data within the selected range of approximations. 

Ne have already presented the Givens Algorithm. The 
problem now is to understand how a systolic array can 
realize it. The reader must be aware that the understanding 
of the mathematical algorithm does not lead to immediate 
insight cn how the array works. For the original discussion 
on this algorithm the reader is referred to [Ref. 7]. 
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Presented in the following are the specifications of the 
cell processors used in this systolic array, the general 
cell arrangement, and the input data streams. However, the 
sight of such specification does not reveal the intricacies 
of the deduction of which function should be performed by 
each particular cell element. A possible way to gain 
insight as to how the algorithm is mapped onto the array is 
to perform an algebraic validation that is extremelly labo- 
rious. For a start, suppose that we want to triangularize a 
(3 X 3) matrix A. We must premultiply it 3 times by elemen- 
tary rotation matrices which will turn the elements a (2, 1) , 
a (3,1) and a (3,2) into zeros. The first transformation will 
be 
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This elementary rotation will turn element (2,1) into null 
on the product matrix. We can note that the element (-£) at 
the transformation matrix occupies the position of the 
element that will be turned into zero in the transformed 
matrix. The transformed matrix will become 

' a* (1,1) a* (1,2) a’ (1,3) ‘ 

0 a* (2,2) a’ (2,3) 

. a(3,1) a (3,2) a (3,3) . 

which will be operated by a second transformation matrix 



c 


0 


S 




■ a’ (1,1) 


a* (1,2) 


a* (1,3) ■ 


0 


1 


0 


• 


0 


a’ (2,2) 


a’ (2,3) 




0 


c 




. a(3,1) 


a (3,2) 


a(3,3) . 



and that will generate the matrix 
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• a«n,1) a’*(1,2) a»(1,3) ' 

0 a* (2, 2) a* {2,3) 

0 a' (3, 2) a’ (3,3) 

Finally the last transformation 
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will generate the upper triangular matrix 

■ a" (1,1) a” (1,2) a«(1,3) ' 

0 a” (2,2) a" (2,3) 

. 0 0 a" (3,3) . 

It is important to note that the s*s and c's are 
different in each transformation matrix and are calculated 
the way we have already pointed out, that is, for the first 
elementary transformation, 

d = sgrt (sgr (a (1, 1) ) + sgr(a(2,1))) 

c = a(1,1)/d and s = a(2,1)/d 

How the evaluation of the elements of each transformed 
matrix should be considered by performing the actual multi- 
plications to correlate the resulting algebra with the algo- 
rithmic processor’s eguations presented in Figs. 3.1 and 
3.2. For this particular algorithm it reguires definitely a 
lot of paperwork. This cannot be accepted as a good method 
if one wants to understand the general data flow in the 
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systolic array, and to check its numeric data. This is why 
we developed the SYSGRAS to have the possibility of doing an 
evaluaticn study on how several algorithms found in the 
literature do perform. 

D, A NDHEEICAL EXAHEIE 

In order to test the Givens Algorithm in doing Matrix 
Triangular ization, we have prepared a numeric study case. A 
system is represented by the matrix equation 
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In Appendix C we shew how to operate on this equation for 
triangulating ■ and to solve it by Back Substitution. Using 
these numerical data, we will set up the problem fer the 
simulator and be able to check out the simulation results 
against our hand calculated values. 

E. SETTING UP THE PECBLEM FOR SIHULATIOM 
1 • Sketch ing the Graphics 

The first step in preparing the problem for simula- 
tion consists of planning the graphics. He have to consider 
the following points: 

- How do we want our systolic array arranged on the 
screen? 

- What kind of shape will be used for the cells (in 
SYSGRAS we have two options; square and octogcnal)? 

The screen coordinates for the cells (SYSGRAS divides 
the screen into a chessboard, and therefore we have a 
span of 8x8 positions to place the array cells) . 
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Which are the input and output ports? This is depen- 
dent on the subroutine that implements the cell 
processor, and so must be compatible with the defini- 
tion of the processor cell subroutine. 

Which links are used to connect the different ports of 
the different cells? We need to identify everything so 
that the correct information at SISGRAS can be 
entered. 

This may be better understood with an example, and 
we will continue the development of the systolic array to 
perform Matrix Triangularization. 

Por this particular case, we will use three types of 
processors: a Givens External cell, a Givens Internal cell 

{both with square roots) , and a Buffer cell. The last one is 
used to help the visualization of the input data array being 
pumped into the systolic array. 

Figures 3.1, 3.2, and 3.3 show which ports are 

considered the active terminals in the actual implementation 

I 

4 

if * 3 O ttnen 
besi a 
c = 1 • O 
s s 0.0 
end 

else be9in 

temp = V'r* 

C = r / t ®tnp 
5 ^ i / temp 
r = temp 

end 

Legend: r - residue i 




Figure 3.1 Givens External Processor Cell 
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Figure 3,2 Givens Internal Processor Cell 
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Figure 3.3 Buffer Cell 
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Figure 3-4 Systolic Array for Matrix Triangularization 



in SYSGRSS for these cell processors. Fig. 3.4 show how to 
arrange the cells and their interconnections. It is impor- 
tant to realize to the fact that although circular shaped 
cells are presented in Figs. 3.1 and 3.4, they are imple- 
mented in the SYSGRIS as octogonal shaped cells. This 
approach was adopted to improve the computational speed of 
the simulator. However, it has been decided to keep the 
standard circular shape in Figs. 3.1 and 3.4 . An addi- 
tional detail for the reader is the fact that cells 10 to 13 
in Fig. 3.4, buffer cells, do not show exactly the same way 
as Fig. 3.3 does. This is because a buffer cell is imple- 
mented in SYSGRAS with three channels, but for Matrix 
Triangularization only one channel is used (and so only one 
output port is shown in Fig. 3.4). Attention should also 
be paid to the fact that the arrows shown in Fig. 3.4 repre- 
sent the links that have actually been used in our simula- 
tion. Cells number 1,2 and 4 of that Figure are of the same 
type of the other sguare shaped cells that appear in the 
same Figure, although there is no arrow drawn on their right 
side. The reason is that a third order system is used in the 
simulaticn and so there is no need to extend the limits of 
the array beyond those cells. 

This type of preparation that has been presented 
here is very helpful when entering the data because there is 
a large amount of information asked by SYSGRAS to be entered 
interactively. If every detail is planned beforehand, the 
interaction between the user and the simulator becomes 
faster and easier. 

2 . Pla nni ng the Inpu t D^a Arr ay 

Cnee the systolic array has been planned and 
sketched, the user must prepare the test data that exercise 
the simulator. This is pehaps the most important step in 
the simulation. At this point the only rule to follow is 
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Figure 3.5 Input Data Array Arrangement 
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that a time shift as required by the algorithm has to be 
respected. The way cclor information will be assigned to the 
entering data is completely up to the user, and the greater 
his/her creativity, the better the simulation effect, and 
consequently, the easier to interpret the results. 

Fig. 3.5 shows a possible way to prepare the data. 
¥e can identify in that figure the numbers from our numeric 
example referred in last subsection. The time shift that can 
te seen is necessary to provide the necessary synchronism 
for the systolic processing. In some algorithms, the output 
data to the external world must also be collected in a 
synchronous fashion. But, for the particular case of Matrix 
Triangularization, as soon as the final result is achieved, 
each datum is frozen in its cell and waits for a pumping out 
operation whose connection links are not being considered in 
our simulation. 
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IV. THE AM AIYSIS OF THE SIM OLA T IONS 



A. DIVIDED DIFFEBENCIS 

Before going into detailed analysis of the Matrix 
Triangulari 2 ation sone simple algorithms are presented. 

In numerical interpolation, we need to compute the 
divided differences from a set of points and then fcrm a 
polynomial from these divided differences. The problem of 
mapping the algorithm for calculating these differences into 
a systolic array structure has been recently addressed in 
[Ref- 10]. In that paper, Li and Smith propose a systolic 
architecture consisting of triangular cells. Some upward, 
some downward, according to that paper, have the same' 
internal architecture but must perform differently depending 
on their orientation with respect to the data flow. 
Certainly, this poses the problem of sensing the direction 
of the data flow and implementing some additional logic in 
the cells to change their function accordingly. Li and Smith 
also discuss some implementation problems in their paper, 
and point out that their actual cell "is not exactly the 
same as described" [Ref. 10: page 542]. 

Their algorithm is implemented in SYSGRAS, but the basic 
cell has teen modified; grouping one upward triangle and one 
downward triangle into the same ceil. It is noticed that the 
upward cell was in fact actuating just as a traf fic-orientor 
buffer and no additional information was being incorporated 
into the data flow. It wouldn't be cost effective to have 
that kind of processor since the introduced complexity was 
due to the need to establish a homogeneous type of architec- 
ture based on triangles. Therefore, our basic cell became 
that shown in Fig. 4.1, and the systolic architecture. 
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Figure 4,1 Divided Differences Cell Processor 

almost the same as that proposed by Li and Smith, appears in 
Fig. 4.2. 

We have investigated the problem of finding the divided 
differences of the set of points (1.0, 1.0), (1.8, 2. 2), 

(3. 0,3. 3), (4.3,3. 5) and (5.3,4. 1) in the x-y plane. 

According to [ Bef . 10: page 539], the first level divided 
differences are 

yMI) = (y{2) -y (1))/(x(2) -x(1) ) = 

= (2.2-1. 0)/ (1.8-1. 0) = 1.5000 
y’{2) = (y(3)-y(2))/(x(3)-x(2)) =0.91666 
y’(3) = (y(4)-y |3))/(x(4)-x(3)) = 0. 15385 
yMi<) = (y (5)-y(4))/(x(5)-x(4)) = 0.6000 
Continuing with the calculations , the second level 
divided differences are 

-0.29167 
-0. 30512 
0.19398 
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Figure 4.2 Divided Differences Array after Clock Cycle 1 
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Figure *1.3 Divided Differences Array after Clock Cycle 2 
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Figure 4.4 Divided Differences Array after Clock Cycle 3 
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Figure 4,5 Divided Differences Array after Clock Cycle 4 
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and the third level divided differences are 



-0.00408 

0.14260 

and the fourth level is 0.03411. This suggests a kind of 
pyramid structure that can be seen from Figs. 4.2 up to 
4.5. The calculation of the differences is performed in 
four-clock cycles, one represented in each figure. The 
bottom rov displaying the first order differences, the upper 
row, the second order and so on till the top cell. We have 
related each point (P1-P5) to a primary color, that is. Pi 
is blue, P2 is red, P3 is green, P4 is blue, and P5 is red. 
In Pig. 4.2 we can see the result of the first step of 
calculations, with the mixing of these colors at the bottom 
cells. Going through each step and comparing them with the 
mathematical formulation, we get a better understanding of 
the data interaction. The the final results in Fig. 4.5 
should be compared with the above calculated values. 

B. HATEIX-HATRIX HDITIPLICATICH 

This is another simulation problem that have been tried 
on SYSGEAS. The presentation of the algorithm has teen dene 
at [Bef. 1: page 9]. Presented in Fig. 4.7 is the arrange- 
ment cf the systolic array and how the data is pumped into 
it. The elements of matrix A can be seen flowing towards the 
lower right and those cf matrix 3 flowing towards the lower 
left. The resultant matrix C is pumped upwards. We have 
performed a simulaticn to compute the matrix product AB=C as 
below 
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The scheme used for systolic computation has an advan- 
tage due to the fact that a great number of matrices that 
are used in typical problems are band matrices [Bef. 9]. 
Consequently, the systolic array does not need to include as 
many cells as it would have to for the case of full 
matrices. In this numeric example, in order to restrict cur 
array to a small size so that it can be seen on the screen 
(as in Fig. 4.8), we decided to set elements a (1,3) and 
b(3,1) tc zero. This way, a systolic array of small dimen- 
sions can multiply larger matrices with restricted nonzero 
band. 

A single type of cell is used in this algorithm. It is 
called Inner Product Step Processor. Its geometry and algo- 
rithmic definition are shown in Fig. 4.6 . This same type of 
cell will be presented later in another algorithm 
implementation. 

In this simulaticn problem, different colors are attrib- 
uted to different rows of matrix A and to different columns 
of matrix B. As we know, each element of the product matrix 
C will be generated by a combination of one row of A and one 
column of 3. When these elements join each other in a 
systolic cell, we will be able to identify exactly that 
combination taking place at each cell in the space-time 
frame. He present here the coding of colors that was 
adopted ; 
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Figure 4.6 loner Product Step Processor Cell 



To show how colors can help in understanding the mecha- 
nism of irultiplicaticn in the systolic array, we will track 
the generation of the element c(3,2) • fJe see from the above 
color coding that c(3,2) is a yellow element, since it 
results from the ccirbination of a green row and a red 
column. From mathematics, 

c(3,2) = a(3, 1) .b (1,2) + a ( 3, 2) . b (2, 2) a {3 , 3) . b (3 , 2) 

Figure 4.7 shows a schematic diagram that corresponds to 
clock cycle number 1, the same cycle is also shown in 
Fig. 4.8, in which elements a (1,1) and b(1,1) have entered 
cells nuitbers 14 and 15 respectively. From Fig. 4.7 it can 
also be seen that the element a (3,1) (green element) will 
enter into the array (at cell 06) at clock cycle 3 (see 
Fig. 4.7), while element b(1,2) (red element) will enter at 
clock cycle 2 (at cell 12) (see Fig. 4.9). After that, they 
will tun through the array and will meet each other at cell 
02 at clock cycle 5 (see Figure 4.12). This meeting gener- 
ates the first partial result a (3, 1 ) . b (1 ,2) . This can 
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Figure 4.7 Systolic Array for Matrix-Matrix Multiplication 
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Figure 4.8 iatrix-Hatrix Hultiplication after Clock Cycle 1 
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Figure 4.9 Matrix-Matrix Multiplication after Clock Cycle 2 
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Pigure 4.10 Hatrix-iatrix Hultiplication after Clock Cycle 3 
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Figure 4.11 Hatrix-Hatrix Multiplication after Clock Cycle 4 
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Figure 4.12 Matrix-Hatrix Multiplication after ClocJc Cycle 5 
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Figure 4.13 Batrix-Matrii Multiplication after ClocX Cycle 6 
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Figure 4.14 Matrix-Matrix Multiplication after Clock Cycle 7 
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Figure fl.15 Matrix-Matrix Multiplication after Clock Cycle 8 
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Pigure 4.16 Ha trix-Hatrix Hultiplication after ClocX Cycle 9 
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te verified to be 8x1=8 by our numeric example and by the 
contents of cell 02 at clock cycle 5 . This partial result 
with yellow color (identified as c(3,2) for convenience) is 
pumped upwards to cell 09. Performing similar verification, 
it can te seen that elements a (3,2), b(2,2) and that partial 
result of c(3,2) will meet at cell 09 at clock cycle 6 (see 
rig. 4.13). They will, then, generate a second partial 
result 

a (3, 1) .b (1 ,2) + a (3,2) .b (2,2) 

which is 8x1+2x7=22 as seen in Fig. 4.13 at cell 09. By 
similar reasoning, the final result 8x1+2x7+3x10=52 is 
generated in cell 14 at clock cycle 7 (see Fig. 4.14). 

It can be seen that matrix C is symmetrical with respect 
to the color coding. Symmetrical elements like C(3,2) and 
c(2,3) are computed in parallel and pumped up side by side. 
So, they are easily tracked with the help of colors, and the 
whole operation can be better visualized. 

C. MATRIX TEIANGOLABIZATION BY GIVENS ROTATIONS 

We will provide a brief review of this problem. The 
nomenclature to be used is the same as that in Chapter III. 
The reader is also referred to Appendix C, "A Numerical 
Example for Matrix Triangular! zation ”, which uses the same 
data as that described here. 

The linear system that needs to be solved is 

AX = B 

where A is a nxp matrix, X is a column vector of p elements, 
and E, a column vector of n elements. 

We will premultiply the matrix A by a transformation 
matrix Q such that the product matrix becomes an upper 
triangular matrix E. To keep the matrix equality it is 
necessary to premultiply vector B by matrix Q. This will 
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result in the product vector QB. This operation is shown 
below: 

QAX = QB 

and as E = QA, it beccmes 



EX = QB 



The Givens Eotations in this algorithm under simulation 
triangularize the matrix A and computes the vector QB 
concurrently. This is possible because 



Q (AiB) = (QA) 1 (QB) = E( (QB) 

where the operator ” 1” performs matrix concatenation. lor 
clarity, if we define 
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then A|B becomes 



A| B 



2 4 1 12 

5 7 4 3 

3 0 18 



As we have shown in Chapter III, the simulation uses the 
cell structure seen in Fig. 3.4 and the test data seen in 
Figure 3.5 (the same as that used in the numerical example 
cf Appendix C) . The picture that appears on the screen at 
the start of the simulation is shown in Fig. 4. 17 which the 
reader should compare with Fig. 3.4 in Chapter III. 
Fig. 4.17 shows the identification of each cell at its lower 
right corner. Cells numbered 13 to 10 represent the data 
wavefront to be pumped into the systolic array. They are not 
part cf the array, but only buffer cells to allow presenta- 
tion of the data to he pumped into the array on the screen. 



The final elements of vector QE are calculated in cells 04, 
02, and 01. The final elements of matrix A are calculated in 
cells numbered 09, 08, 06, 07, 05 and 03. 

As shown in Appendix C, the values for AX = B are as 
follows and AjB is as shown above. 
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The test data are pumped into the array as wavefronts 
that can be seen in cells 13 to 10 just before being entered 
(the reader can refer to Fig. 3.5 to the several wavefronts, 
each one corresponding to a different color) . There is a 
time shift among tie elements of the same row for the matrix 
under triangularization (compound matrix A|B). This 
displacement is needed to establish the correct timing for 
the systolic operation. In this simulation, we decided to 
assign a different color for each row of A|3 (as shown in 
Fig. 3.5) and to keep track of the flow of the colored 
elements through the array. This flow can help us to under- 
stand, in a sequence of space-time frames, how the whole 
array behaves. The effect of the different clock for subseq- 
uent cycles can be seen from Fig. 4.18 to 4.26 . In 
Fig. 4.18, element a(1,1) of matrix A enters into cell 09. 
T?e can see, in cell 13 element a (2,1), and in cell 12 
element a (1,2). They are ready to be clocked in at the next 
clock cycle. This happens as seen in Fig. 4.19, when cells 
09 and 08 display the stored result of the computation 
performed. Notice that cells 09, 07, and 03 actuate like a 
reflector for the data wavefront that is going downwards. 
They rotate the data flow direction by 90 degrees counter- 
clockwise. As a result of the interaction between the wave- 
fronts that go towards the right and that going downwards, 
there is a resulting wavefront that flows towards the lower 
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right. ihis is the important observation because it shows 
how the effect of the input data spreads over the array 
(this is why we decided to associate this wavefront with 
identical colors) , The wavefront flowing towards the right 
carries information about the angle that the row elements 
must he rotated. This information must arrive at each cell 
just in phase with information of the matrix element teing 
rotated, which is being transferred to each cell by the 
downward wavefront. At each clock cycle a new rotation angle 
is calculated at the inclined boundary cells (Givens 
External Cells, namely 09, 07 and 03) and transmitted tc the 
rightest cells at the same row (that is from cell 09 to 
cells C8, 06 and 04 in the upper row, from cell 07 to cells 
05 and 02 in the second row and from cell 03 to cell 01 in 
the lower row) , Partial results are . generated at each 
systolic array element at each clock cycle. At the moment 
when the downward wavefront start feeding zeros into a 
systolic array element, it freezes its content and do not 
modify it any more. He have selected the black color to 
indicate that the input data are bringing no information. 
The black wavefront flowing through the systolic array acts 
as a freezing wavefront. Figs, 4.19 up to 4.26 show the 
effect of the remaining clock cycles. In Fig. 4.26 we have 
the final result of the triangularization frozen into the 
cells. The upper triangular matrix R and the vector QB have 
teen computed. It can be checked out against those values 
shown in Appendix ”A Numerical Example for Matrix 
Triangularization". • Ihe computed results are ready to be 
used in the Back Substitution to solve the system equations. 
In order to transfer the data from the cells to the hardware 
that performs the Back Substitution operation, a special 
connection which is net shown here becomes necessary. But, 
it is not required here for the understanding of the Givens 
Eotatiens process. 
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Figure 4.17 Matrix Triangular ization Array 

Initializatron 
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Figure 4.18 Hatrix Triangularization Array 
after Clock Cycle 1 
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Pigure 4.19 Matrix Triangular ization Array 
after Clock Cycle 2 
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Figure 4.20 Matrix Triangular ization Array 
after Clock Cycle 3 
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Figure 4.21 iatrix Triangularization array 
after Clock Cycle 4 
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Figure 4.22 Matrix Triangular ization Array 
after Clock Cycle 5 
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Figure 4.23 Matrix Triangular ization Array 
after Clock Cycle 6 
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Figure 4.24 Matrix Triangular ization Array 
after Clock Cycle 7 
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Figure 4.25 Hatrii Triangular ization Array 
after Clock Cycle 8 









Tigure 4.26 Hatrix Triangular izati on Array 
after Clock Cycle 9 
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We will now present a different kind of perspective to 
the reader on how the algorithm performs. The geometrical 
interpretation of Givens Rotations is extremely helpful to 
provide tetter insight, A matrix can be considered as repre- 
senting a set of vectors (as many as the number of columns) 
in a n-dimensional space (n equals the number of rows) . If 
the matrix has only three rows# their columns are vectors of 
a three dimensional space, and the elements of the first row 
are the components over the x-axis of the column vectors. 
The rotation operation does not rotate the vectors. The 
reference frame (coordinate axes) is the one that is 
rotated, and, as a consequence, the description of the 
vectors in terms of their components with respect to the new 
reference becomes different. When a matrix is rotated to 
become an upper triangular, the first column is transformed 
in such a way that orly element (1,1) will be nonzero. This 
means that the first column vector is positioned over the 
x-axis in the new reference frame. As the vector is the same 
as before, the numerical value of the new element (1,1) must 
be equal to the length of the vector. Element (1,2) of the 
rotated matrix, for example, is the x-component of the 
second column vector in the new reference frame. Since the 
relative position of both vectors is unaltered, the new 
value of that element must be equal to the projection of the 
second column vector over the first column vector, since 
this last one is now along the new x-axis. The geometrical 
interpretation can be extended to all elements of the array. 

We will track tie build up of the numerical values of 
cells 09 (that stores element (1,1)) and 03 (that stores 
element (1,2)). This allows us to study the operation of 
both tjpes of cells used in this algorithm (cell 09 is an 
outer cell and cell 08 is an inner cell) . It will also 
provide a comparison between the geometric interpretation 
presented above and the algebraic values shown in graphics. 
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Figure 4.27 Givens External Processor Cell 

T)e will start with cell 09. From Fig. 4.27 we see that 
this cell receives the elements of the "nonrota ted” first 
column vector, computes the rotation angle (calculating its 
outputs cosine c and its sine s) , and stores the "rotated" 
x-component r of the first vector. The other components are 
obviously zeros and correspond to the other elements of the 
first column (cells of the lower array that are not shown 
because their contents are always zeros) . The rotation 
angle information is passed as output to cell 08 to be used 
to "rotate" the second vector. Table I shows how the param- 
eters r, c, and s of this cell respond to the external 
inputs. These values can be verxfied with the help of the 
cell algorithm presented in Fig. 4.27 . The description of 
the operation can be followed through the seguence of 
Figs. 4.28 to 4.30. 
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TABLE I 

Time Description of Outer 


n 

Cell Operation 


Cycle 


Input z 


Output 


c 


Output s 


Residue r 


0 


0.0 (black) 


1.000 




0.000 


0. 000 


1 


2.0 (blue) 


0.000 




1.000 


2.000 


2 


5.0 (red) 


0.371 




0. 928 


5.385 




3.0 (green) 
0.0 (black) 


0.874 




0.487 


6.164 1 




1.000 




0.000 


6.164 

. 



At clock cycle 0 no input has been received and so there 
is nc-thing to rotate. The rotation angle is zero (c=1.0 and 
s=0.0). At cycle 1, the first element is received. The first 
element is the old x-component of the first column vector. 
As it is over the x-axis, the reference does not need to be 
rotated now. However, when this occurs, because of the way 
the algorithm is implemented {we will see the reason when 
analysing cell 08) , an angle. 90 degrees is informed to cell 
03 (c=0.0 and s=1.0) although the value r stored in cell 09 
is 2.0, egual to the input. At clock cycle 2, the 
y-component of the old first column vector is entered. Now a 
rotation is necessary to keep the x-axis of the reference 
frame over the first column vector. The rotation angle is 
computed (c=0.371 and s=0.928) and the reference frame is 
rotated (about the z-axis). The new x-component (stored in 
r) is the sguare root of the summation of the squares of the 
old X and y-components, that is 5.385 (corresponding to the 
projection of the first column vector over the x-y plane) . 
At cycle 3, the z-component is entered. As the last rotation 
resulted in a vector over the x-axis, the combination of 
this with the just entered z-component will result in a 
vector in the x-z plane deviated from the x-axis because of 
the newly arrived z-component. Again a rotation is necessary 
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I 

I 

Figure 4.28 Beference Fraae before Rotations 

to keep the x-axis over the first column vector. The new 
x-component is 6.164. At clock cycle 4 a black datum is 
received (a zero value) and that freezes the contents of 
cell 09 with its final value. No more rotations are required 
and the output of the cell informs rotation angle zero 
(c= 1.0 and s=0. 0) . 

New we will follow the build up of the contents of cell 
08 (the right side neighbor cell of cell 09) . Cell 08 stores 
element (1,2), the x-component of the second column vector. 
As done before, we present table II with inputs and values 
stored in this cell at each clock cycle. In this case we 
will disregard the numerical output of this cell because we 
will concentrate on the build up of the residue r in this 
cell. From Fig. 4.31, this cell receives as input the rota- 
tion angle (inputs c and s) , by which the reference frame 
has changed, to be used to compute the new x-component of 
the second column vector (to be stored at r) . The other 
input refers to the elements of the "nonrotated" second 
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Figare 4.29 Beference Frame after First Rotation 
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Figure 4.30 Reference Frame after Second Rotation 
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Pigure 4.31 Givens Internal Processor Cell 

column vector (input z) . The outputs of this cell are the 
rotation angle (c and s are passed as output to the right 
neighbor cell 05) and rotation information to the neighbor 
cell immediately belcw cell 08 to ••rotate” the other compo- 
nents of the column vector as a result of the reference 
frame rotation. 



The operation of this cell is more difficult to visu- 
alize. Figure 4.32 will be used for the explanation. It only 
shows the x-y plane. Suppose the second column vector is A, 
as seen in that Figure. The old reference frame is xl-yl. 
The ccmpcnents of A with respect to that frame are ax and 
ay. In our numerical example, ax=4.0 is the first input 
z(i) to the cell. The second input z (i) to the cell, on the 
following cycle, is ay=7.0. Suppose the reference frame is 
rotated about the z-axis to position x2-y2 to keep the 
x-axis along the first column vector, which is shown as B. 
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TiBLI II 

Time Description of Inner Cell Operation 



1 



Cycle Input c 



0 

1 

2 

3 

4 

5 



1,000 

1.000 

c.ooo 

0.371 

0.874 

1.000 



black) 

black) 

(blue) 

‘red) 

g reen) 
lack) 



Input s 

0.00 0 (black) 
0.000 (black 
1.000 i blue) 
0.928 (’red) 
0.487 green) 
0. 000 (black) 



Input z Residue r 



0 , 0 
0 . 0 

4.0 

7.0 
0 . 0 
0.0 



’black) 

I black) 

’blue) 

'red) 

green) 

[black) 



0.000 
0.000 
4.000 
7. 985 
6. 976 
6. 976 



L 



J 




rigure 4.32 Gecaetric Interpretation of a Rotation 

Instead of evaluating the components of A in this new frame, 
let us study the effect over its components ax and ay. In 
this new frame, ax will be decomposed into components axx 
(over the new x-axis) and axy (over the new y-axis) . 
Similarly, ay will be decomposed into components ayx (over 
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the new x-axis) and ayy (over the new y-axis) . The component 
of A over the new x-axis is the summation of axx and ayx. 
This will te the new element (1,2). As seen in Fig. 4.32, 

axx = ax . cos (alf a) and ayx = ay . sin (alf a) 

and as element (1,2) is 

r = axx + ayx 



it results 



r = ax . ccs(alfa) + ay. sin(alfa) 

As ax was the previous r and ay is the newly arrived input 
z (i) , in computer algorithmic language we have 

r = c(i) * r + s(i) ♦ z (i) 

as in Fig. 4.31 . Substituting the values of clock cycle 2 
into this equation, we find 

r = 0.0 ♦ 0.0 + 1.0 ♦ 4.0 

r = 4.0 

and this can be checked against the above table. At clock 
cycle 3, the values give us 

r = 0. 371 * 4.0 + 0.928 * 7.0 

r = 7. 985 

that also checks against the above table. Doing the same for 
clock cycle 4, we get r=6.976. This gives the final value of 
the x-ccmponent of the second column vector. The rctaticn, 
however, will affect all other components. Let us describe 
how this occurs. The component of A over the new y-axis is 

z (o) = ayy - axy 

z (o) = ay . cos(alfa) - ax . sin (alf a) 
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and in algoritlimic language, 

z (o) = c (i) * z (i) - s (i) * r 

This information z(c) is passed to the cell immediately 
telow to generate the new y-component of the second column 
vector. 

A further point to be noticed is that if the rotation 
angle input to an inner cell is zero (c=1.0 and s=0.0), the 
cell will not change the stored value at r. This is why the 
rotation angle is informed as 90 degrees, as mentioned 
previously, when the first element is received at cell 09. 
This triggers cell C8 to receive and store the z input at 
the following clock cycle. At the end of clock cycle 02, the 
z=4.0 input is stored. No rotation is performed, so r=4.0. 
The following clock cycles impose the modification of the 
contents of this cell because of the changes in the refer- 
ence frame. As soon as the input z freezes (at clock cycle 
5), the residue r cf the cell also freezes. It can be 
observed, if the outputs of cell 09 and its inputs to cell 
08 are compared, the transmmission delay of one clock cycle 
from a cell to another. It can also be noticed that the 
inputs to cell 08 always have the same color. This simula- 
tion was purposely designed this way to show the need for 
sinchronization between the wavefronts that propagate 
through the array. 

Certainly we cannot present all this complexity with the 
simulator. The algorithm of Givens Rotations is the most 
complex that we have traced in our survey. However, the 
simulator has been a fundamental tool to perform a numerical 
study and to verify the correctness of the algorithm. 
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D. BACK SDBSTITOTIOB 



fle will use the results of the previous section as input 
data for the simulation described in this section. This way, 
with these two sections, we can go through the complete 
problem described in the Appendix "A Numerical Example for 
Matrix Triangularization”. That is, to search for the vector 
X in matrix equation AX=B. The process of Back Substitution 
is described in [Ref, 7: page 24], 

One of the cells used in this algorithm is shown in 
Figure 4.33 . The above reference presents the cell shown 
in Fig. 4.33 as a circular shaped cell named Back 
Substitution Cell. Since SYSGRAS would take too much time to 
draw circular shapes, an octogonal shape is used instead 
(see cell number 05 in Figure 4.37). 

The other type of cell used in this algorithm is the 
same Inner Product Step Processor that was presented in 
Fig. 4.6 to do Ma trix-Matrix Multiplication. This is an 
interesting aspect of systolic arrays: many alg o ri thms that 
appea r in the l iteratur e are implemente d wi^ ;Uie s ame types 
of cells. However, the geometry of the array, in this algo- 
rithm, requires the cell to be square shaped. Although the 
cell is functionally the same as shown in Fig. 4.6, we 
present it again in Figure 4.34 to match the way Fig. 4.3*^ 
represents it as part of the structure (see cells 04 and 
02 ) . 

The mathematical background for solving a triangular 
linear system involving a lower triangular array has been 
presented by Kung in [Ref. 1: page 19]. We will adopt here 
his explanation. 

Suppose the system equation to be solved is 
EX = D 

where E = (r(i,g)) is a nonsingular nxn lower triangular 
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Figure 4.33 Back Suistitution Main Cell 
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Figure 4.34 Inner Product Step Processor Cell 



matrix and D is a a n-vector, both being given. To coirpute 
the vector X, we can use Forward Substitution as follows: 

k := 1 

y(i,1) := 0 
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repeat 

y(i,k*1) := y(i,k) + r(i,k) . x{k) 
k ;= k + 1 
until k = i 

x(i) := (d (i) - y(i,i)} / r(i,i) 

Ihe above algorithra can be used to calculate the 
elements of vector X in the sequence x(1), x(2), ... and so 

on. This algorithm can be implemented in a systolic array as 
will be shown. Y = (y(i»k)) is a vector of partial results 

that allows the recurrence to build up. 

In our case, since the interest is in Back Substitution, 
we will simply enter the data into the systolic array in an 
crder reverse to that proposed by Kung in his paper. Let us 
make this point more clear. We do Forward Substitution when 
we have a lower triangular array. In this case we solve 
first for x(1), next for x (2) and so on. Following the 
sequence proposed by Kung, vector X should be pumped into 
the systolic array as x(1) , x(2) and x (3) . The same for 

vector QB. Since here we are doing Back Substitution, we 
enter the vector X in a reverse sequence x (3) , x(2), and 

x(1) . QE is also entered the same way. The way matrix R 
being pumped into the systolic array is modified with 
respected to that proposed by Kung. For instance, the first 
of its elements to be pumped into the systolic array is 
r(3,3) {see cell 10 in Fig. 4.36) which is required to 

compute x(3). In the Forward Substitution Method, the first 
element pumped should be r(1,1) to compute x(1) . The ether 
elements of R are rearranged accordingly. 

The system equation to be solved in this simulation is 

RX = QB 

where E is an upper triangular matrix and QB is a n-vector 
that resulted from the transformation seen in the previous 
Section. 
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Figure 4.35 Data Arrangement for Back Substitution 



Figure 4.35 presents the arrangement of the data with 
respect to the systolic array. The data elements are drawn 
in such a way that the required data synchronization to 
perform the operation is evident. Vector QB and matrix E 
actually carry data into the systolic array for processing. 
Vector Y enters the array bringing no data. Its elements are 
all zeros at the moment represented in Fig. 4.35 . Its 
values are modified as it flows through the array. Vector X 
is generated into cell 05, the Back Substitution Cell, and 
its elements are pumped through the array in a direction 
opposite to that of vector I. As they go through cells 04 
and 02, they combine with elements from matrix R to build up 
the elements of vector Y. Finally, X will come out with its 
final value when it leaves the array, from cell number 02 of 
Fig. 4.35. 

Figures 4.35 and 4.36 should be compared. This last one 
presents the same arrangement as the former, but it shows 
how the picture appears on the screen. Matrix E is on the 
upper block of the cells (cells number 07 to 21) (Refer to 
Appendix C to compare the numerical values) . This block does 
not belong to the array itself and is used only for presen- 
tation of the data. The systolic array is represented by 
cells C5, 04, and 02. The elements of vector QB are intro- 
duced from its left side (at cell 06). The vector of 
partial results Y, a string of zeros separated by one clock 
cycle delay, is pumped in from the right side (at cell 03) . 
The output that will collect the solved elements of vector X 
is represented by cell 01. Attention to the fact that the 
numbers that are displayed at cells 04 and 02 are the values 
assumed by the elements of vector Y as they flow leftwards 
through the array. The number displayed in cell 05 is the 
element of vector X being computed and that in cell 01 is 
the element of X being pumped out to the external world. 
Another point to notice is the way the bidirectional 



79 



connections between cells 05 and 04, and 04 and 02 is imple- 
mented by SYSGRAS. Only one bidirectional arrow is used. 

Re decided to associate the elements of each column of 
the matrix R (third column in blue, second in red, and first 
in green) and the corresponding row elements of vectors X 
and QB with the same color. (e.g. the element number 3 of 
the QE will interact only with the column number 3 of the 
matrix R) . Cell processor number 5, the so called Back 
Substitution Cell is shaped differently, as mentioned 
earlier, to emphasize its special purpose in the systolic 
array. Ihis cell processor computes the elements of vector X 
from the received data and pumps them out to cell 04 to be 
utilized for calculation of the previously referred partial 
results. When the X elements complete their trip through 
the array, they are pumped out at cell 01. 

Ihe manner in which color is used to provide information 
is now described. Ihe elements of vectors QB, X and Y, as 
well as matrix R, have primary colors (red, blue and green). 
Rhen they meet elements of different primary colors, the 
cell where this meeting takes place assumes a secondary 
color that is the result of that combination. The elements 
of vector Y, for example, can be seen with their original 
color in cell 03, before mixing up with others. During their 
trip through the array they keep that color, but as a result 
of their combination with different color elements, the cell 
where they might be may present ‘ different secondary colors. 
However, we should keep in mind that only primary colors 
travel from a cell to another. Secondary colors are static. 

To make these points more clear, we will track the 
formation of elements x{3) and x (2) . Since R is an upper 
triangular matrix, we have 

x(3) = gb(3) / r(3,3) 

Substituting numerical values, as shown in Appendix C, 
x(3) = -10.593 / 0.842 = -12.571 
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Figure 4.36 Back Substitution Array Initialization 
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As it has teen pointed out, the Back Substitution Cell, cell 
number 05 in Fig, 4,37, is responsible for computing the 
elements of X, That Figure presents the computation of 
element x(3). Element gb(3) was pumped in from cell 06 (see 
Fig. 4,36) and element r(3,3) from cell 10 (see same 
Figure), It can also be seen in Fig, 4,37 that x(3) is blue 
since it was formed bj the combination of gb (3) and r(3,3), 
both blue (remember that the third column of R has been 
coded blue). After its computation, element x(3) is pumped 
through the array to cell 04 (clock cycle 2, Fig. 4.38) , to 
cell 02 (clock cycle 3, Fig, 4,39) and finally to the 
external world (clock cycle 4, Fig. 4,40) at cell 01. During 
this trip, its value is used for computation of the elements 
of vector Y which are necessary for computation of the other 
elements of vector X, Now let us track the formation of 

X (2) , 

From algebra, we have 

X (2) = (gb (2) - x(3).r(2,3)) / r(2,2) 

The partial result x(3),r(2,3) is computed at clock cycle 2, 
in cell 04. This cell receives x (3) from cell 05 (computed 
at clock cycle 1, see Fig, 4.37) and r(2,3) from cell 08 
(see Fig. 4.37), This partial result is called y (2) (color 
coded red because it will contribute to the formation of 
X (2) , which color code is red) . This time, y (2) , being orig- 
inally red, appears in cell 04 as magenta because the 
contribution of y(2) is activated blue when it interacts 
with blue r (2, 3) and blue x(3). At this point of the compu- 
tation all necessary data to compute x (2) is available. 
Element y (2) is stored in cell 04, element gb (2) is ready to 
enter the array (see cell 06, in Fig. 4.38), and so is 
element r (2,2) (see cell 10 of Fig. 4.38). At clock cycle 3 
these elements are pumped into the Back Substitution Cell 
(cell 05) (see Fig. 4.39) and the computation of x (2) takes 
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place. Substituting numerical values (as seen in these 
cells) into the above equation it becomes 

x(2) = (-0.566 + 11.538) / 4.042 = 2.714 

Element x(2) assumes color red because all factors that 
contributed to its formation had that color code. After its 
computation, x (2) is pumped through the array via cell 04 
(clock cycle 4, Fig. 4.40), cell 02 (clock cycle 5, 
Fig. 4.41) and finally pumped out to cell 01. During this 
trip, similarly for x(3), it contributes to the formation of 
y(1), another element of the vector of partial results. This 
will be required to the computation of x(1). 

The sequence of pictures from Fig. 4.36 to 4.44 displays 
the whole computational process in the systolic array 
according to the algorithm. 

E. FDETHER EXPLORATICHS 

The potential of SYSGRAS goes far beyond the implementa- 
tion of systolic algorithms. Presently, because of pratical 
importance, extensive research is being carried out on the 
investigation of faults in the actual systolic devices 
[Ref. 11 ]. How to circumvent those faults and which effect 
they do have on the results are some of the subjects under 
study. SYSGRAS can be used as a valuable tool in performing 
such studies, because of the simplicity of its interface 
with the user and because of the possibility of inclusion of 
possible subroutines to support the algorithms. To emphasize 
its versatility, we address the fact that a processor cell 
may have a large number of input/output ports (although it 
is presently set up for a maximum of four) . This allows a 
large number of connections with other cells or the outside. 
This characteristic car be used, as example, to inject data 
into a cell that is situated in any position in the array. 
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Figure *1.37 Back Substitution Array after Clock Cycle 1 
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Figure 4.38 Back Substitution Array after Clock Cycle 2 
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Figure 4.39 Back Substitution Array after Clock Cycle 3 
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Figure *1.40 Back Substituticn Array after Clock Cycle 4 
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Figure ^1.41 Back Substitution Array after Clock Cycle 5 
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Figure 4.42 Back Substitution Array after Clock Cycle 6 
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Figure 4.43 Back Substitution Array after Clock Cycle 7 
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Figure 4.44 Back Substitution Array after Clock Cycle 8 
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Ihese data could consist of logical commands that would 
alter the function cf the processor for the selected cell. 
Ihese commands, in a particular application, could "kill” 
the cell, or degrade its performance according to a desired 
need. Another approach, that would reduce the data entry 
volume, would be to design a subroutine to implement a 
defective processor and assign that processor to the target 
cell. Ihe color feature can also be used to analyse the 
effect that faults at a particular cell will have over the 
others. If a color is injected at selected input ports 
(depending on the software of the subroutine that supports a 
processor) , it will spread over the cells that receive data 
from this faulty cell under evaluation, and will display the 
"bad sector" on the screen. 

Many other possible studies related to systolic arrays 
may be considered due to the flexibility of the SYSGEAS 
software. 
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V. THE INTEBA CTIOH ilM THE SIMD LATOB 



A. SCFTllEE REQOIHEHEHTS AHD EBOCEDORES 

Id order to operate with SISGRAS, the user must have the 
following files. 

SISTOI.PAS 

SISICE.FCR 

LOGIN.COM 

JDEK.CCM 

The user, after compiling SYSTOY.PAS and SYSFOR.FCR, 
must link both relocatable codes with SIGGRAPH core, which 
is at CS3202.CORE library. To do this, the command to be 
use d is 

LINK SYSI0Y,SYSF0E, (CS3202. CORE) CORE/LIB 

In order to run, enter the command "RON SYSTCY" . Ihe code 
will address the EAHIEK peripheral system when running, and 
this device should be turned on and ready to operate. 

B. ENTERING A NEW PECBIEM 

When the command "RUN SYSTCY" is entered, the simulator 
starts prompting all questions that must be answered by the 
user to enter the problem. In Appendix D we have recorded a 
session of the Simulator to make the interaction more under- 
standable. It was cur original intention to present this 
recording of the session as a guide to SYSGRAS for the 
Matrix Triangulariza tion problem. However , due to the large 
amount of material that needs to be presented, we decided to 
include a short "dummy" session in Appendix D for clarity. 
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It shows hew to enter the data and how to handle an incor- 
rect input. The simulator permits recovery from entry errors 
and permits display of input data as many times as desired. 
The user has the following choices when dealing with the 
simulator: 

select a new problem run or a previous problem session 
record the session in a separate file for later 
printing 

ability to interrupt the graphics presentation at the 
end of each clock cycle 

It is recommended that the user follow e xac t l y the same 
nomenclature that was mentioned in Chapter III Section E to 
avoid errors that might be difficult to troubleshoot. 



C. REVIEW OF A PREVIODS SESSION 



We have included in Appendix E a recording of a session 
with SYSGEAS for the purpose of reviewing a previous 
session. The user must take care to rename the file which 
contains the data from previous session into a file under 
the name SYSAEEAY. HEM . This file must be the most recent 
version generated by the VAX VMS Operating System. 

The following files with data from previous sessions are 
already available, and may be run upon request from the 
user : 



file DIVDIFF.HEM, with recording of simulation anal- 
ysed in Chapter IV, Section A. 



file HDLTPIY.HEH, with recording 
ysed in Chapter IV, Section 3. 



of simulation anal- 



file GIVEHSW.MEM, with recording of simulation anal- 
ysed in Chapter IV, Section C. 
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- file BAKSOBS.MEH, with recording of simulation anal- 
ysed in Chapter IV, Section D. 

D. BECOBDIBG OF COHPDTED DATA AND FEINTING ODT 

If the user has asked for this facility, SYSGEAS will 
write all computed results of each cell at each clock cycle 
into a file named SYSPRINT.DOC, which can be printed out 
after the end of the session. 
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71. TBZ SIHO LATCH HAINTE NAHCE 



A. STBOCTUBE OF THE SIHDLATOH 

The SYSGRAS has teen designed in two layers. The upper 
layer that interacts with the user is in PASCAL, and so it 
is guite portable. In fact, due to the use of the 
"OTHERWISE" feature offered by the "CASE" command in th€ VAX 
PASCAL CCMPILEE, a few of its case structures have to be 
slightly modified before it can be installed in another 
machine. The lower layer is constructed in FORTRAN 77 and 
it is compiled with VAX FORTRAN. This layer interfaces with 
the Graphics Package SIGGRAPH that is presently available at 
NPS on the VAX 750 machine. It has been set up for presenta- 
tion on the RAMTEK 9400 system, and so it is not portable. 
Its structure, however, can be modified to interface with 
another graphics package with characteristics similar to 
those of SIGGRAPH. 

The PASCAL layer is presented in Appendix A and the 
FORTRAN layer is in Appendix B. 

B. MCDUIAR DESIGH ASPECTS 

The design of SYSGRAS has followed the philosophy of 
modular design. The data structure has been designed to 
match the abstract idea of a systolic array and its multiple 
features, subsequently it can be easily identifiable by the 
program user. The basic element is the record "NODE". This 
element keeps all the information regarding each cell. The 
whole array is a collection of NODES’s, and these are assem- 
bled into a higher level hierarchy by the record "GRAPH" 
which contains all information about the whole array at each 
clock cycle. 
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lie part of the program that is most relevant tc the 
user refers to the cell processor support routines. If new 
types of cells need tc be simulated, new support subroutines 
must he added to the existing set. This can be done without 
complete knowledge of the program implementation because of 
the modularity of the design. Now we will refer specifi- 
cally to this kind of software maintenance. 

C. DESIGNING AND INSTALLING A NEB PROCESSOR SDPPOBT 

SOBRCDTINE 

The cell processor support subroutines modify the global 
data structure represented by the variable G of type GRAPH. 
The existing set of such subroutines are under the comment 
titled ’’library routines for cell processors" in Appendix A. 
If a new one needs to be created and added, it should 
conform with the pattern seen in examples. The subroutine 
implementation must he placed between the statement "with 
G.NODES(.I.) do begin" and the statement 

"COLCB_PIOCESSING (G, I) " . This last command calls a subrou- 
tine that will compute the color of the cell as a result of 
the combination of the different primary colors of the input 
data. 

The following steps must, therefore, be followed to 
introduce a new cell subroutine: 

i. increase the constant NDflBER_OF_ROUTINES . 

ii. modify the enumerated type PROCESSOR_TYPE to include 
another routine reference name (e.g. R0UTINE_9) . 

iii. modify procedure DEFINE_NODES to include references 
to the new subroutine. This should be done in such a 
way that the new routine is referred the same way as 
the others. 

iv. modify procedure OOTPOI_ARRAY the same way as above. 
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V. modify procedure COEEECT_NODE as above. 

vi. modify the main program statement "case PEOCESSOE 
cf” as above. 

vii. place the body of the new procedure next to the 
existing procedures to conform with the pregram 
structure. 

Hints for the design of the new procedure: 
the type NODE defines the cell data structure. 

- the number of input/output ports in a cell can be 
modified by changing the constant CONNECTIONS in the 
main program. If this is done, the constant HAXlIKKS 
also must be modified as instructed in the program 
text comments. 

The present implementation of SYSGEAS restricts the 
maximuB number of cells in an array to 23, in order to 
achieve a faster computational speed. If desired, that can 
be modified by altering the main program constant MAXNCDES. 
If this is done, the constant MAXLINKS must also be modified 
as instructed in the program text comments. 
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VII. CONCIDSIO NS 

A. ESTAELISHIHG COHPARISOH PIBAMETEES 

The simulations that have been studied in Chapter V 
provided us a reasonable tool to use for evaluation of algo- 
rithms implemented in systolic arrays. Important factors 
that must be considered in this kind of an evaluation are: 

i. use of the same type of processors in other algo- 
rithms 

ii. average percentage of cells involved in effective 
computation per clock cycle 

iii. degree of homcgeneity of cells 

iv. degree of the usage of pipelining 

•V. possibility of modular expandability 

vi. required number of external connections to each cell 

vii. cell complexity 

Unfortunately not all these factors can be determined by 
using this tool. Sub jectiveness also has to play a role in 

this evaluation. 

E. CCHCIOSIONS ABOni ALGORITHMS UNDER ANALYSIS 

The i nplementa ticn of an algorithm in a systolic array 
may provide a greater calculation speed, but a cost was paid 
to get that achievement. The cost can be evaluated in terms 
of the complexity of the design, the difficulty in implemen- 
tation, the manufacture problems that can result from a 
particular array configuration, etc. Ke want to make sure 



99 



that this cost is not excessive and, ideally, to be minimuni. 
These are considerations that require criteria of evaluation 
to be established. Instead of trying to come up with a 
criteria that could be the subject of lengthy discussion, 
the above mentioned factors are used to discuss the effi- 
ciency of the algorithms with a more qualitative than quan- 
titative approach. 

The first factor we want to establish is that concerning 
the average percentage of cells effectively involved in 
active computation per clock cycle in each algorithm. For 
each algorithm we consider the number of required cycles to 
run till completion, the number of physical cells in the 
systolic array, and the number of cells involved in active 
computation in each clock cycle. This last factor corre- 

sponds to the number of cells shown on the screen with a 
color other than black. Thus we do have: 

Matrix Triangular ization : 
number of cycles = 9 
total number of cells = 9 
percentage of computation / cycle = 

= 1/9 X 1/9 X (1+2+4+5+6+5+3+1+0) = 

= 0.333 

Performing similar calculations for the other algo- 
rithms, we obtain the numbers shown up in Table I. 
Eefinitely, the Matrix Triang ularization has a longer ’’duty 
cycle” per cell and this means less waste. 

The number of types of cells required for each algorithm 
is easily seen in the pictures shown in Chapter IV. Mo dula r 
expandab i l it y is another important feature. It implies the 
possibility of interconnecting chips (each one embedding a 
whole systolic array) in a cascade fashion or seme ether 
arrangement where the chips are used as smaller parts of a 
higher hierarchical arrangement. Matrix Triangularization 
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and Back Substitution have restrictions in that respect due 
to the fact that their cell arrangement is not symmetrical 
and such an expansion would require different types of 
chips. 

"With respect to these factors, we can summarize in the 
following table; 



TABLE III 

Comparison of Algorithm Implementation Evaluation 

Factors 



FEATURE 



Average 
percentage of 
duty cycle 
ccmputang 
per cell 



Number of 
cell types 



Nodular 

expandability 



MATRIX 


BACK 1 


DIV. 


TRIANG. 


SUBST. 


DIFF. 


0.333 


0.250 


0.250 


2 


2 


1 


R 


R 


Y 



MATRIX 

MUIT. 



0.153 



E = restricted 
Y = yes 



The number of external connecti ons required by each cell 
is pehaps the most demanding factor existing in pratical 
implementations. The normal connections are power and clock 
lines. Some algorithms like Matrix Multiplication need only 
these two. Some others have synchronization and data feeding 
problems that can only be solved with the addition of extra 
external connections to the cell. This feature is somehow 
related to the degree of pipelining that can be achieved. 



The Matrix Triangularization and the Divided Differences 
algorithms can be pratically implemented if there is a flag 
signal that can trigger a pumping out mechanism to send the 
data frozen in each cell after the last clock cycle. To 
recover the data and display it externally, additional 
connections are needed in each cell. This increases the 
total cost of the chip and complicates the problem of 
modular expandability. Since a design goal in every VLSI 
implementation is to reduce the number of external connec- 
tions, this is a key factor in the design effort. 
Pipe l ining possibility is also affected because the pumping 
out operation is not part of the normal algorithm cycle. It 
is an additional burden that may require the interrupticn of 
the data pumping intc and the pipelining will have to be 
restricted to data of the same problem. Data from different 
problems can not coexist at the array at the same time. A 
calculation of a prctlem has to finish in order to start 
another. In this sense, if we examine the Matrix 
Multiplication algorithm, we will conclude that true pipe- 
lining can be achieved. No data flow interruption occurs 

because the results are not kept frozen in the cells, tut 

are moved as part of the data flow. 

The factor of cell complexity is important not only 

because cf the possible high cost of the hardware, but also 
because cf the time that may be required for a clock cycle 
to be executed. If a longer time is required, the clock 
speed has to be kept low and the efficiency of the 

processing decreases. Simple operations are always a goal on 
the algorithm design because they result simpler and faster 
hardware. This is why another algorithm for Matrix 
Triangularization without Square Roots has been suggested. 
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C. SDGGESHOHS FOB FDTOEE MODIFICATIONS ON SYSGRAS 



W€ are conscious that we have not achieved an optiial 
design with our simulator in terms of performance. It could 
be improved if the graphics implementation were more effi- 
cient. This would speed up the interaction with the user and 
make it more attractive. Another point that cculd be 
improved is the user interface. The amount of information 
that is required from the user requires an effort that might 
be minimized if other interface techniques were used, such 
as combined use of mouse or lightpen with the keyboard 
input . 

Additional points that could be modified to enhance the 
presentation at the screen are: 

Elimination of non significant zeros from the cell 
display (such as turning 000.400 into 0.4). 

Change of the bidirectional arrow that represents 
ccmmunications in both directions between cells into 
unidirected arrows, one for each link between cell 
ports. This would make the cells appear on the screen 
the same way they are represented in the literature. 
An example of the birectional arrow can be seen in 
Fig. 4.36, in Chapter IV, connecting cells 05 and 04, 
as well as 04 aid 02. 

D, THE BOIE OF THE SINOLATOR 

Our goal was to contribute to the understanding of the 
implementation of algorithms in systolic arrays. This led us 
to the realization of some of the inherent problems that 
appear in this field. The complexity of the data interac- 
tion in the space-time frame is considered to be the 
greatest obstacle in the understanding process. To decide 
on the approach to adopt to handle the problem is also 
difficult. In response to that, we designed a tool whose 
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task is to provide a software implementation of the systolic 
array. Ihe user is left free to concentrate on the algo- 
rithm. fle have shown the use of this tool on the study of 
some algorithms. This gave us the opportunity to realize the 
power of a systolic array in the process of performing 
calculations. Aspects such as computation time, memory 
requirements and I/O interactions are minimized. The mapping 
of an algorithm onto a systolic array certainly represents a 
major difficulty. The main role of our simulator is not to 
help in this mapping, but it can be used in the validation 
phase of the algorithm design. The greatest contribution 
that we see in SYSG5AS is its versatility. The user can 
model an ideal environment for the algorithm or an environ- 
ment contaminated by problems in the underlying hardware. 
This certainly can help in achieving better results while 
presenting a clear idea of the robustness of an algorithm. 
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APPENDIX A 

SXSTOLIC ABBAY SOFTHAEE SIMOLATION PEOGBAM 
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var M,I: integer; 
tegin 
writeln; 
writeln; 

writeln ('YOU Will BE PROMPTED TO CONSTRUCT A SYSTOLIC ARRAY') 
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writelQ (' ENTER SCREEN COORDINATES:*); 
writeln(* ENTER X (integer from 1 to 8) 

readln (G. NODES (.CELLNUHB.) . COORDX) ; 
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G. ARCS (.1, J, K,I.) :=FALSE 
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C0L0R: = CCI0R+C0DE ( 
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begin 

with G. NODES (. 
begin 



OUT(. 1.) : = INE (. 1.) ; 

0UT(.2.) :=INF(.2.) ; 

OUT (.3. ) • DATDI1: = INP (.1.) .DATUM + INP (.3.) .DATOM- 
INP (. 2. ) .DATUM*MEM (.1.) ; 

OUT (. 3.) • SPECTRUM: =INP (. 3. ) • SPECTRUM; 
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else begin 



OUT (. 1.) .DATUM: =MEM (. 1.) /SQRT (SQR (MEM (. 1 .) ) 

+ SQR (INP (. 1. ). DATUM) ) ; 

OUT (. 2.) .DATUM:=INP (. 1. ) .DATUM/SQRT (SQR (MEM(. 1.) ) 
+ SQR(INP (.1.). DATUM) ) ; 

MEM (.1.) : = SQET (SQR (MEM (. 1. ) ) +SQR (INP { . 1 . ) . EA TUM) ) 
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(* nomenclature is 
A (i) ==> 3KP{. 

B(i) ==> INP(. 
C(i) ==> INP (. 
A(o) ==> CUT (. 
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X (i) ==> INP{. 

C (i) ==> INP(. 
S (i) ==> INF (. 



Z (i) ==> INf ( 
X(o) ==> Ot*I{ 
C(o) ==> ODK 
S (o) ==> ODT ( 
Z (o) ==> Ot1 { 
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PROCESSOR <> ROUTINE 0 ) then 



SQUAEE; SQUARED (COLORCODE, COORDX,COORBY) ; 
OCTAGONtOCTGON (COLORCODE, COORDX, COORDY) ; 
d (* case SHAPE *) ; 
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begin 

vriteln (*ENTER ID NUMBER OF WRONG NODE') 
readln (NODEIE) ; 
with G. NODES (.NODEID.) do 
begin 



writeln {* SELECT ITEH CHANGE CODE FROM MENU: 
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PROCESSOR: =EOUTINE 
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begin 

writeln (’ ENTEB NEW ABCISSA*) 
readln (NEWCODE) ; 



COOP.DX:=NEWCODE; 

viriteln (*0K, CHANGE IS DONE*); 
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begin 

writeln ('SELECT YOUR OPERATION CODE; 
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WANI_EEC:=fals€; (♦ to avoid risk of recording twice *) 
speedy review of whole sequence *) 
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writeln (’REVIEW ; II YOU ALSO ASKED FOE RECORDING FOR PRINTING’) 
writelr ('THERE IS ANOTHER RECORD CALLED SYSPRINT.EOC THAT YOU’) 
writelr(’CAN ASK THE OPERATING SYSTEM TO SEND TO PRINTER IF’) 
writeln (’YOU WISH;'); 
writeln ; 



writelL(» IHIS IS THE END OF THE SESSION 

{♦ finish with graphics device *) 

FI KIT ; 

end (* SYSTOLIC_SIHCLATION *) . 
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call DCLNDX (3, 6, 0 . 8, 0. 0, 0 . 8) 
call DCLNDX (3,7 , 0. 8 , 0. 8, 0 - 0) 
call DCLNDX (3 , 8, 1 . 0 , 1 . 0 , 1 . 0) 
call SWINDO (0.0,40 0. 0,0.0,400. 0) 
call CBBSEG (1) 
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X (1) =0.0 
y(1) =0.0 
X (2) =400. 
y (2) =0 . 0 
X (3) =400. 



y (3) =400.0 
X (4) =0.0 
y (4) =4 00.0 
call SPISTY (2) 
call SFNDX (8) 
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y (1) =yc-18.0 
y (2) =y (1) 
y (3) =yc+ 18 .0 
y (4) =y (3) 
call SPISTY (2) 



call SFNDX (color) 
call SLNDX (1) 
call P0LYA2 (x,y,iJ 
EETURM 
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X (5) =XC+8. 0 
y (5) =YC+20. 
X (6)=XC-8. 0 
y (6) =YC+20. 
X (7) =XC-20. 



y (7) =YC+8.0 
X (8) =XC-20. 0 
y (8) =YC-8. 0 
call SPISTY (2) 
call SFNDX (color) 
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write (6,200) coordx, coordy 
200 format (* CEIL COORDINATES ARE X=*,i2,’ AND 



€ndif 

if (numb .It. 0.0) then 
call TEXT (*-*,1) 
call HOVR2 (4. 0,0.0) 
else 
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endif 

if (p (i) .eg. 1) then 
call TEXT (’ 1 ’ ,1) 
else if (p (i) .eg. 2) then 
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SUBROUTINE IDENl (coordx ,coordy ,id) 1 print 

identification label on cell bottom right 






<N 

































q 




c 




q 




c 




q 


04 






























0 ) 




(P 




a; 




(P 




(P 


% 






























x; 








Xi 








X! 


•H 






























-p 








-p 




-p 




x> 




























q 






















>1 


























0 ) 




CN 




m 








in 




vn 






























T— 




T— 




T— 




T— 




T— 




M 


























4 J 






















0 
































» 


oi 




cn 




T» 




CJ’ 


0 




0 


0 






















T— 


0 ) 


CN 


0 ) 


rn 


<v 


rj- 


<P 


in 


CP 


U 


U 


f 


f 




















T— 
























>1 


in 


LD 






0 




































X 


% 


(N 


CM 






>1 














t 






















nd 


0 


1 


1 






% 














cr» 


H 


•H 


H 


•H 


H 


•H 


H 


•H 


H 


•H 


M 


X 


X 


P-i 






U 










0 




CP 


X 




X 




X 




X 




X 




0 


«b 


HD 


nd 






X 


CO 












t 


w 


UM 






w 


04 


w 


U 4 




O4 


0 


0 


M 


M 
















* 






H 




H 




EH 




H 




H 




0 


>1 


0 


0 


















CN 




























0 


0 


0 


0 


<N 


X 




0 




nd 




•H 


rH 


m 


rH 


4-^ 


rH 


4-1 


rH 


4-^ 


rH 


4-< 


CN 


0 


0 


U 


f 


f 


•< 


Q 




T— 




•H 


r- 




H 


•H 


rH 


•H 


rH 


•H 


iH 


•H 


H 


•H 




X 


* 




CM 


m 


> 


;s 




\ 


T— 


1 


II 




(d 




<d 




cd 




(d 




(d 




M 




0 


0 


«“ 


(N 


0 


H 




t— 


nd 




•H 




0 


OJ 


0 


(P 


0 


<P 


0 


CP 


U 


<P 


<D 


sr 


• 


• 


•f 


1 


S 


cn 


Td 


nd 


•H 


•H 








in 




W 




in 




in 




W 


tn 




0 


0 


U 


0 






•H 


•H 


II 


II 


0 


m 




rH 




H 




rH 




•H 




H 


a; 


H 


LO 


ID 


X 


>1 


H 


f — 1 


II 


II 






T— 


’H 




CP 




CP 




<P 




CP 




CP 


4 J 


(0 


II 


II 


II 


II 


H 


rH 


T— 




T— 


CN 


























Pi 




0 


0 


0 


0 


(d 


(d 




nj 




" — 


0 
























•H 


M 


X 




X 




u 


u 


•r1 


•H 






nd 

























163 





P3 








p 














0) 




0) 




0) 




0) 






















-02 










-P 




-P 




-P 




-P 




































CQ 




Ch 




O 












r— 




T— 




T— 




t— 




o 


r“ 




















•- 


% 


cn 


m 


cn 


» 


Crt 




cn 


m 




o 


m 


0) 






00 


0 




0) 


o 




«> 


VO 


t 


m 
















lO 


» 




















t 


— ' 




















cr 




•H 




•H 


H 


•H 


H 


•H 


H 






fc-< 




X 












Px2 










w 


Ui 


w 


Ui 




Oi 


W 




CN 


w 




H 




H 




H 


— 


H 




« 






















> 




MH 


rH 


HH 


rH 


MH 


rH 


MH 


fH 




o 


rH 


•H 


rH 


•H 


fH 


•H 


rH 


•H 


H 




E2 


iH 




(T3 




03 




03 




03 


MH 




(TJ 


0) 


O 


a? 


o 


0) 


O 


0) 


O 


•H 


rH 


O 


(/3 




(/3 




W 




W 




no 


H 




rH 




rH 




fH 




fH 




C3 


03 




0) 




0) 




0) 




0) 




0) 


O 



<i; 

a 

•H 

n 

O 

O 



n 

M 

W 



Q 



rH ^ 
fH -P 
0) O 
O rQ 

td m 

o 

s 

o w 

M 0) 

(t3 



3E 






CN 












O 


•H 




4H 












M 


no 




fH 












UI 


U| 




03 












03 


o 


















o 




r— 












c; 


o 




03 












03 






4-1 














0) 




H 














4c 




03 


Ul 












-p 




% 












03 






>1 












M 


fC 




rH 


P-l 










Q 


a; 






X 












> 




X 


% 












•H 




H 


rH 












Cr^ 






















03 














U 




MH 


fH 










CN 


Q) 




H 


X 










>1 






03 


% 










% 


-P 




% 


•H 










CN 


O 




O 


PH 










X 


P 




>1 














03 






03 










T-" 




CN 


o 


CN 












0 


>1 


X 












% 


-P 


*» 














r— 




CN 


m 


03 










X 




X 


>1 


CN 
















% 


X 




O 


O 


o 








4H 






• 


• 


• 








X 


03 




U) 


tn 


U) 


o 








T~ 




CN 


CN 


CN 


PC 




T— 


(/3 


>1 




1 


1 


1 


Uh 




X 


>1 




cn 


r- 


CN 


T— 


f< 








03 




X 


X 


>1 






CN 




r— 


LO 


■«• 




* 


W 




# 


X 


X 


T— 


o 


o 


o 


:z; 




M 






p- 


• 


• 


• 


H 




03 


P* 


p* 


T— 


o 


o 


o 


Eh 




Cr» 


•»• 


•«• 


t 


to 


LO 


LO 


PD 




0) 


fH 


rH 


fO 


II 


II 


II 


o 




-P 


03 


03 


II 


03 


p 


03 


w 




P 


03 


03 


•H 


r— 


CN 


f— 


PH 




•*H 


M 


Ui 


Ul| 


X 


X 


P-f 


n 


















to 



















O 

r- O 



164 



y2a=50. 0*y2-25. 
xc= (x1 a+x2a) /2. 
yc= (y1a+y2a) /2. 
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endif 

lx = 9.0*cos (alfa) 
ly=9.0*sin (alfa) 
xs=xc-lx/2 . 0 
xf =xc+lx/2. 0 



js=yc-ly/2. 0 
yf =yc+ly/2. 0 

call SLNDX (2) ! make = 1 for black arrow 

call M0VA2 (X£,ys) 
call LINA2 (xf,yf) 
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The second elementary transformation matrix is 



where d = sgrt (5.38518x5. 38518+3x3) =6. 16443 
c = 5.38518/d=0. 87359 
s = 3/d=0. 48666 
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We have started with matrix A and by means of three elementary 
rotations we have transformed it into matrix R,that is, we got 
the transformation QA = R. Note that the elements r(2,1) , 
r(3,1) and r(3,2) haven't been reduced to an absolute zero, but 
their value is a residue due to the round-offs that are 
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Finally we came up with the system represented ty a matrix 
equation that can te easily solved by Back Substitution, 
RX = CB. Approximating elements r(2,1), r (3, 1) and r(3,2) to zero 
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This way we can save time with the calculations. This is the 
scheme adopted when the algorithm is mapped into a systolic 
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